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Abstract. A binary sequence of length n with w ones can be identified 
by its lexicographical rank in the set of all binary sequences with same 
number of ones and zeros, which is of size , ," ! — Although that enu- 
meration has been deeply studied for binary case, it is less addressed for 
cr-ary sequences, where a > 2. 

Assuming n is a fixed predetermined parameter, the enumerative coding 
of a given n-symbols long sequence S over alphabet E — {ei, £2, . - . , e CT } 
consists of first specifying the frequencies of characters in S denoted 
by C — (ci, C2, . . . , Ccr), followed by indicating the rank of 5* among all 
possible sequences with the same C vector. The required code lengths 

are (a — 1) [log(n + 1)] bits for C and [log — | — ^ f ] bits for the rank. 

This study focuses on efficient schemes for enumerative coding of a- ary 
sequences by mainly borrowing ideas from Oktem & Astola's [T] hier- 
archical enumerative coding and Schalkwijk's [2] asymptotically optimal 
combinatorial code on binary sequences. 

By observing that the number of distinct cr-dimensional vectors having 
an inner sum of n, where the values in each dimension are in range [0 . . . n] 
is K(a, n) = ( CT ™7-i) (T) > we P ro POse representing C vector via 

enumeration, and present necessary algorithms to perform this task. We 
prove log K(a, n) requires approximately (a — 1) log(<r — 1) less bits than 
the naive (a — l)[log(n + 1)] representation for relatively large n, and 
examine the results for varying alphabet sizes experimentally. 
We extend the basic scheme for the enumerative coding of u-ary se- 
quences by introducing a new method for large alphabets. We experi- 
mentally show that the newly introduced technique is superior to the 
basic scheme by providing experiments on DNA sequences. 



1 Introduction 



Let S be a finite sequence of length n, where symbols are drawn from alphabet 
S = {ei, £2, . . . , e CT } and the number of occurrences of individual characters are 
indicated by the vector C = (ci, C2, . . . , c„). 

The fcth-order empirical entropy of S has been defined by Manzini [3] as 

H k (S) = - V \ws\-H (w s ) 

, where ws represents the string formed by concatenation of all symbols following 
the fc-symbols long context w in S, and Hq is the zeroth-order entropy defined 



by H (s) = — J2vees P( e ) ' 1°§ P( e )- Note that P(e) is the probability of symbol 
e in s. 

On the other hand, Grossi et.al [I] proposed finite set entropy that approxi- 
mates the A:-th order information content of S as 

H k (S) = ± ]T log( ^ ) 

, where c w is the frequency of A-symbols long context and similarly c wu is the 
observation count of following context w in S. The zeroth order finite set 
entropy of S is then H (S) = fog( Cl! . C2 " ! ,,. c<T i )- 

The idea behind finite set entropy is that S can be identified by its rank in 
the sorted list of all sequences having the same frequencies of individual symbols. 
Similarly, given the position index and the frequencies, it is possible to generate 
the original sequence S. 

The enumerative coding is more related with the finite set entropy, where 
Huffman or arithmetic coding is computed according to the empirical entropy 
estimation via a Markovian approach in practice. 

The number of ones, w, in a given binary sequence of length n can be specified 
by using [log(n + 1)] bits as w can take one of [n + 1) values in range [0 . . . n]. 
The list of all unique length-n binary sequences that have exactly w ones and 
n — w zeros is of size sequence among this set can be identified 

by indicating its lexicographical rank, which requires [log ( ( n -w)\-w\ )~\ bits. 

Lynch [5j and Davisson [B] proposed that scheme to represent binary se- 
quences, which is named as enumerative or combinatorial coding. More general 
frameworks has been investigated by Cover[7] and Rissanen[8]. A comparison 
against arithmetic coding has been given by Cleary & Witten [9]. 

Although log ( ( n -w)\. w \ ) i s optimum due to the entropy of a finite set, the 
|~log(n + 1)] bits needed to code the weight information cause an overhead. 
This overhead becomes more significant in case of c-ary sequences. The naive 
encoding of the frequency vector C requires (a — l)[~log(n + 1)] bits assuming 
n is fixed predetermined parameter, which otherwise necessitates cr [log (n + 1)] 
bits in case of variable n. Several studies have addressed this fact and proposed 
alternatives for binary case. However, enumerative coding of sequences over large 
alphabets has not received much attention to date. 

Our Contribution. This study focuses on enumeration of c-ary sequences. 
We show that the number of distinct er-dimensional vectors having a constant 
inner sum of n, where the values in each dimension are in range [0...n], is 
K(a, n) = J^r=o C-i-i) (!)• We have noticed that Oktem&Astola pQ previously 
indicated this numbeiQ as ("^-i^ 1 )' wn ich is in concordance with our finding 
according to the Chu-Vandermonde identity [10] . We prove logK(a, n) requires 
approximately (a — 1) log(<r — 1) less bits than (a — l)[log(n + 1)] for relatively 
large n, and examine the results for varying alphabet sizes experimentally. 



However, we could not find the proof or a derivation of the formula in their regarding 
paper. 



We extend the basic scheme for the enumcrativc coding of tr-ary sequences by 
introducing a new method, that borrows ideas from Schalkwijk's [5] enumerative 
coding of binary sequences via variable length blocks. We experimentally show 
that the newly introduced technique is superior to the basic scheme by providing 
experiments on DNA sequences. 

The results indicated herein are of significance not only in terms of enumer- 
ative coding, but also for the related fields such as enumerative combinatorics 
PTj . text indexing |4 1 12] , information theoretic analyses of sequences and 
jumbled pattern matching [TJ. 

The outline of the paper is as follows. We investigate the related previous 
studies in section 2. We introduce the compact representation of frequency vec- 
tors in section 3 with the necessary enumeration algorithms. In section 4 we 
describe the proposed <?-ary enumeration scheme based on factorizing the input 
sequence into variable-length blocks. We give the implementation details and 
experimental results on DNA sequences in section 5 and conclude by section 6. 

2 Related Work 

Since the studies concerning enumerative coding had generally focused on binary 
sequences, previous efforts for the efficient representation of weight w had also 
appeared mainly on the binary alphabet. 

In the scheme proposed by Schalkwijk [5] the sequence is partitioned into 
variable length blocks, where each block contains either w ones or (n — w) zeros. 
During the partitioning, the sequence is scanned from left to right and whenever 
either the specified number of zeros or ones is obtained, the process stops. The 
subsequence is then generated by concatenating the symbols since the last cut 
point and hypothetically padded with the remaining ones or zeros, so that its 
length becomes n. The rank in the lexicographically sorted list of all binary 
n-bits long sequences with w ones uniquely identifies that subsequence. The 
partitioning and coding continues in the same way until the end of the input is 
reached. 

As an example, assume the sequence is 1100100010100111, and the param- 
eters are w — 3, n = 5. The subsequences with padded bits shown between 
brackets are 1100{1}, 100{11}, 010{11}, 100{11}, and 111{00}, which can be 
denoted by their lexicographic ordering among the gf^r = 10 possible cases 
of 00{111}, 010{11}, 0110{1}, 0111{0}, 100{11}, 1010{1}, 1011{0}, 1100{1}, 
1101{0}, 111{00}. In the decoding phase, fixed-length code words are read, and 
mapped to their corresponding sequences. It is easy to detect the padded bits 
again with the same methodology of factorization as once w ones or (n — w) 
zeros are observed in a subsequence, it can be determined that the rest of the 
string would be composed of all ones or zeros that are not part of the original 
sequence. 

The length of the code words are fixed and the blocks are variable in Schalk- 
wijk's proposal. Hence, w and n are predetermined parameters that does not 
need to be coded separately in each block, which brings an advantage. On the 



other hand, we have an overhead due to the padded bits. However, the enumer- 
ation scheme of Schalkwijk is shown to be asymptotically optimum for binary 
memory less channels [2J. Another difficulty in practice is finding the optimum 
values of w and n. 

Oktcm & Astola [1] introduced hierarchical enumerative coding for binary 
sequences. They divide the sequence into smaller subsequences. The main idea is 
that once the total number of ones in the larger sequence is known, the weights 
of the smaller length blocks can be coded more efficiently. For example, we divide 
the sample binary sequence 1100100010100111 into four equal length blocks as 
1100, 1000, 1010, 0111, where the weights of each block are 2, 1, 2, and 3 respec- 
tively. If we know in advance that the main sequence is of length 16 with a total 
weight of 8 and each substring is 4 bits long, then we can count the ways that 
4 non-negative integers sum up to 8, and simply represent the vector (2, 1, 2, 3) 
via its rank in that set. They proposed to perform this in a tree structure, where 
the weights of the child nodes are determined by their ancestors (see |H15j for 
detailed discussions), and hence, given name hierarchical enumeration. 

Another study concerning to reduce the overhead of transmitting the weight 
parameter has been presented by Dai&Zakhor [16] . In their work the authors 
proposed to model the distribution of w values via a Poisson process, and then 
code individual weights by Huffman coding according to their probabilities in 
the estimated distribution. 



3 Enumerating the frequency vectors 

The overhead in enumerative coding of S is the information required to represent 
the individual symbol frequencies. Assuming that the length n of S is predeter- 
mined, the naive way of representing the frequency vector C = (c\, ca, . . . , c a ) 
requires (a — 1) log(n + 1) bits. Notice that a — 1 items are enough as we can 
compute the last frequency by subtracting the total frequency of the remaining 
items from n. 

In this study we propose to represent C vector via enumeration. Theorem 1 
with lemma 1 below counts the distinct cr-dimensional vectors that the sum of 
the individual components is predefined. In other words, we count the ways that 
a counters sum up to a specified value when we know in advance each counter 
is greater than or equal to zero. 

Lemma 1. The number of ways that K counters, whose values are positive in- 
tegers, sum up to N is 

KsumN(K, N )=(^ c Z'i)- W 

Proof. Assume we have a chain of N balls connected by single wires. Obviously 
we have N — 1 connections. When we cut K — 1 of them, we will be left with 
K partitions. The number of balls on each partition identifies the value of the 
corresponding counter. Since this selection can be performed in (^Zi) ways, this 
is the total number of distinct ways that K counters sum up to S. 



Theorem 1. Let V = (v%, «2, ■ ■ • , v a ) be a a -dimensional vector of integers, 
where vt > for all < i < a and J^. Vi = n. The number of such distinct 
vectors is 



n — 1 \ ( o~\ fn + a — 1 



Proof. Among the cr dimensions of V, at least one of them is greater than zero 
and at most all of them are non-zero. If we represent the number of zero values by 
i, the summation iterates over i from to cr — 1. The second binomial coefficient 
(TJ in the summation considers number of ways that we can choose i from cr 
dimensions, where the first binomial coefficient ( ""^jj counts the distinct ways 
that n items can be partitioned into cr— i urns. By the Chu-Vandcrmonde identity 
[TO] this summation becomes ( n ^ r a _^ 1 )- 

Lemma 2. Let V = (vi, v%, . . . , v a ) be a a -dimensional vector of integers, where 
Vi > for all < i < a and J^. = n. T/ie enumerative coding of C re- 
quires approximately (cr — 1) log(cr — 1) bits less then naive representation of 
(cr — 1) log(n + 1) for a > 2 and n is relatively larger than a. 

Proof. According to Stirling's approximation of z\ w \/2ttz(-) z , the K(a,n) = 
can be estimated as 



V ' ' y27rn(cr-l) n™ • (cr - K> 
Rearranging the terms in equation |3] yields 



V ' ; V 27m(cr - 1) 27rn(cr - 1) V n 1 y a - 1 1 V ' 

The square-root term is less than one, and hence, the logarithm of K(a,n) 
can be approximated as 

log K(a, n) w n[log(n + a — I) — logn] + (cr — l)[log(n + a — 1) — log(cr — 1)] 
= (n + cr — 1) log(n + cr — 1) — n logn — (cr — 1) log(a — 1) (5) 

Assuming log(n + a — 1) logn, when cr > 2 and n is relatively larger than 
cr, equation [5] becomes 

log A"(cr, n) « (n + cr — 1) logn — n logn — (cr — 1) log(er — I) 

= ((7-1) bgn -(cr-l)bg(cr-l) (6) 

Thus, considering log(n + 1) sa logn, the enumerated representation of a 
frequency vector is approximately (cr — 1) log(cr — 1) bits less than the naive 
(cr — 1) log(n + 1) representation. 



Figure Q] compares number of bits required in enumerative coding of vectors 
versus naive representation for alphabet sizes of 4, 20, 128, and 256. For varying 
sequence lengths of n, where n < 1000, (er — l)logn versus log K(a.n) values 
are plotted. The advantage obtained by the proposed enumerative coding be- 
comes more significant with the increasing alphabet sizes, and the gain tends to 
converge to the theoretical value (er — 1) log(er — 1) as stated by Lemma[2] 
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Fig. 1. Comparison of enumerated representation of frequency vectors versus naive 
representation in varying alphabet sizes. 



The enumerative coding of the C = {c\, Ca, . . . , c a ) requires a ranking among 
the cr-dimensional vectors, whose inner sum is a predetermined value n. Once 
we have a mechanism to sort all such vectors, we can identify any of them by 
indicating its rank among all possibilities. 

We devise such a ranking, where c\ and c a ~\ are the most and least significant 
positions respectively. The c a value do not have an effect on the ordering since 
it is not a free variable as its value is determined by n — J2i<i<a- c i- 

The individual vectors are enumerated in an ascending order according to 
their values of cVs. Table Q] shows this ranking on an example case that we 
consider 4 dimensional vectors having an inner sum of 4. Notice that there exist 
K(4, 4) = 35 of such vectors. 

The algorithm Vector2index shown in Figure[2]returns the rank of a given 
vector among the list of all vectors having the same inner sum. We begin with 
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(2,0,2,0) 
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(4,0,0,0) 



Table 1. 4-dimensional vectors, whose inner sums are 4, ordered according to the 
proposed method. 



the the most significant dimension ci, and add index i the number of vectors 
whose first dimension is less than ci . Following the subtraction of the ci value 
from the inner sum t and decrementing the dimension by 1, we continue with the 
next dimension c^. We repeat the process in the same way until all dimensions 
except the last position, which is due to the fact that c a has no effect on the 
index calculation as explained previously. 



Vector2Index 




Index2Vector 


Input: C = (ci , C2 , . . . , <v ) 




Input: i, (, a 


Output: Rank of C among all 




Output: The vector C = (ci, C2, • • • , c a ) ranked 


vectors with the same inner sum. 




i th in the list of all vectors whose inner sum is £ 


1. i <- 




1. for dim -(r- 1 to a — 1 do 


2. e = a + c 2 + . . . + c a 




2. c dtm <- 


3. for dim <— 1 to a — 1 do 




3. while (I > 0) do 


4. for j <— to c dim — 1 do 




4. t <- K(a - dim, I) 


5. i «— i + K((T — dim, I — j) 




5. if i > t 


6. t+-i-c dim 




6. i <— i — t 


7. return i 




7. C d im 4 c dim ~f" 1 

8. e<-i-i 






9. else break 






10. C CT <r- t 






11. return C — (ci, C2, .... c CT ) 



Fig. 2. The algorithms to find the index of a given vector (Vector2index) and to 
generate the vector given its index, inner sum, and dimension (index2vector) . 



As an example let's follow the algorithm to find the index of C = (2, 1, 1, 0). 
We first need compute the number of vectors whose first dimension is 0, which 
is equal to the number of distinct 3-dimensional vectors whose inner sum is 
4 — — 4. In the same way we count vectors having c\ — 1 that corresponds 
to the number of 3-dimcnsional vectors, where the dimension values sum up to 
4-1 = 3 now. Those are calculated with to #(3,4) = 15 and #(3,3) = 10 
respectively. After being done with ci, we proceed to the next dimension with 
C2 = 1. However, before continuing with second item, we decrement the inner 
sum from 4 to 2 since now first dimension is fixed to 2. For the second dimension, 



we need to consider 2-dimensional vectors summing up to 2 — = 2, which is 
K(2,2) = 3. Before visiting the last position we set the inner sum to 4 — 2 — 1 = 1, 
and dimension to 1. Number of one dimensional vectors having a value less then 
c 3 = 1 is K(l, 1) = 1. Thus, the index of the input C is 15 + 10 + 3 + 1 = 29. 

The reverse index2vector algorithm in Figure[5]generates the vector, where 
the index and the inner sum as well as the dimension of the vector are given. The 
computation of the values of the vector begins with the most significant position 
c\. We first count all vectors that have a value at ci, which is computed by 
K(a— 1, £). If that count is less than or equal to our input index i, we decrement 
i by that count and continue with calculating the number of vectors that have a 
value of 1 at c\, which is equal to K(a — 1,^—1) now. We proceed in the same 
way until i becomes less than the number of vectors having some value at c\, 
where that value is actually the first dimension value we are seeking for. After 
setting ci, we do the same processing for c 2 , however, we reduce the dimension 
by one and also consider the inner sum to be I — c\ . All values of the vector C 
are calculated by repeating the procedure similarly. 

The time complexities of vector2index and index2 vector are 0(c\ + 
c 2 + . . . + c CT _i) and 0(£) respectively. Note that actually I = c\ + c 2 + . . . + Co- 
in index2vector, and hence, both algorithms are linear with the inner sum of 
the vector that is being enumerated. 

4 Enumerating q-ary sequences 

Let T = t\ti . . . t n denote a given sequence of length n, where the symbols are 
drawn from alphabet £ = {e 1; e 2 , • ■ • e<r}- We propose to factor T into variable 
length blocks such that symbol e a occurs r times. Symbol e a and r are prede- 
termined fixed parameters of the proposed factorization. 

Careful readers will realize that if UU+i . . -U+j is such a factor of T, then 
U+j+i is always equal to e a . Thus, while coding the sequence we do not need 
to consider the subsequent characters of the factors since they are known in 
advance. 

As an example assume T = ttgaacgagaagccgtatgaaatgaaaatatcac is a given 
sequence of length 33 over alphabet S = a,c,g,t, where the predetermined 
parameters of the proposed method are e a = a, and r = 2. We pad T with 2 a 
symbols to make sure that we have a valid block obeying the rule at the end. 
The factorization of T is depicted in Figure [3l Note that the last two as on block 
be do not belong to original sequence, which can be determined at decoding 
phase once the length of the T is known. Hence, the proposed system requires 
to explicitly store three parameters as the sequence length n, selected symbol 
e a , and the repeat count r of the selected symbol. 

Remembering that c a indicates occurrence count of e a in T, the number of 
blocks with the proposed factorization is B = [(c a / (r + 1)) + lj . Each block can 
be specified with three components as the block length, the symbol frequency 
vector of the block, and the permutation index of the block. 
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(2, 2, 3, 1) 
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11 
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7 
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Fig. 3. Sample factorization of T = ttgaacgagaagccgtatgaaatgaaaatatcac via the pro- 
posed method with parameters = a, and r = 2. 



We suggest to encode the block lengths via arithmetic coding. When the 
symbols are ideally identically distributed over T, the frequency of e a is nearly 
c a « n/cr. We expect to have block lengths mostly in between r-c a and (r+l)-c Q 
symbols, and hence, can be coded/decoded efficiently. 

The frequency vector of each block can be enumerated with the methods 
described in previous section. Notice that the c a value is always r in every block. 
Thus, it can be excluded from the frequency vector, which means it is enough 
to enumerate a — 1 values, whose total sum is r less than the block length. For 
example, in Figure [3J the first block is 7 symbols long and the corresponding 
symbol frequencies are (2, 1, 2, 2). Excluding the first dimension that is known to 
be equal to 2 in advance, we need to enumerate (1, 2, 2). There are K(3, 5) = 21 
three dimensional vectors having an inner sum of 5, which means we have to 
spend [log 21] = 5 bits to indicate the target vector. 
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Table 2. Lexicographically sorted list of all possible sequences over alphabet £ — 
{a, c, g,t} with a sample frequency vector of C = (2, 1, 1,0} . 



When we know the frequencies of each symbol, a block may be indicated by 
its rank among the lexicographically ordered list of all its permutations. The 
algorithms to find the rank of a given sequence (Sequence2PermIndex), and 
its reverse operation (PermIndex2Sequence) as generating the sequence given 
the symbol frequencies and the rank are given in Figure HI The permutation in- 
dices of the blocks shown in figureS]are computed according to these procedures. 

In Sequence2PermIndex, we initially set index to and scan the input 
sequence from left to right. At each position, we count how many possible strings 
precede input S and add this value to index. As an example, let us find the rank 
of agca. Possible permutations can be viewed in Tabled First character a is the 
least significant symbol of the alphabet and hence no symbols precede it. Next 
we investigate the number of strings with prefixes aa and ac, which makes index 



2 + 2 = 4. The last step is to count strings with prefix aga, which is 1. Thus, we 
compute that agca has rank 5 in a zero-based list. 

The PermIndex2Sequence is the reverse of this process. The time com- 
plexities of both algorithms are 0(£), where t is the length of input/output 
string. 



Sequence2PermIndex 


PermIndex2Sequence 


Input: Sequence S = s 1 s 2 ■ ■ ■ si 


Input: pid, C = (ci, C2, . . . , c a ) 


Output: The rank pid of S in the 


Output: The pid" 1 sequence S in the 


ordered list of all sequences 




ordered list of all sequences 


with the same symbol frequencies 




with the same symbol frequencies 


1. for j 1 to (7 do Cj <— 


1. 


temp 


2. for j <- 1 to I do 


2. 


t = Cl + C2 + . . . + C a 


3. k 4— symbolID(sj) 


3. 


for a 1 to t do 


4. c fc <- c fc + 1 


4. 


for j <— 1 to a do 


5. for a «— 1 to I do 


5. 


if (cj > 0) 


6. k <— symbolID(s a ) 


6. 


c j <- c j - 1 


7. for j <— 1 to k — 1 


7. 


* <~ ci!-c 2 !-...-c CT ! 


8. if (cj > 0) 


8. 


temp <— temp + i 


9. Cj <- Cj - 1 


9. 


if (pzrf > temp) 


10- * < rrrTfep 


10. 


Cj <~ Cj + 1 


11. pid pid + f. 


11. 


else 


12. Cj <— Cj + 1 


12. 


Sa <~ j 


13. Cfc <— Cfc — 1 


13. 


temp <— temp - t 


14. return pid 


14. 


pid <— pid — temp 




15. 


return S1S2 • • ■ 



Fig. 4. The algorithms to find the permutation index of a given sequence 
(SEQUENCE2PERMINDEX) and to generate the sequence given its permutation index 
along with symbol frequencies (permindex2sequence). 



Notice that the permutation index of a block need not to be coded when 
the corresponding frequency index includes only one non-zero dimension, which 
means all symbols are same within that block, and that symbol is the one that 
is non-zero in the C vector. 

Analysis: The enumerative encoding of block b% of length < i < B, 
requires log |6j| bits for length, logger— 1, bits for enumeration of frequency 
vector, and H^bi) bits for the permutation index, where H^bi) represents the 
zero-order finite set entropy of the sequence. Hence total space requirement is 

B 

log N + \ogK(<j - 1, \bi\) + H f {bi) (7) 

i=l 

It may be argued to use fixed-length blocks to remove the overhead of block 
length encoding. In this case, we partition the sequence into B' factors of length 
■gr, and the total bits required becomes 

B> 

logK(a, ^) + H£(b i ) (8) 



When we assume the number of blocks are set to be nearly equal on fixed 
and variable length models, although we remove the log term in fixed-length 
model, the frequency vector we need to encode in this case is a dimensional, 
where it is er — 1 dimensional in the variable model. This trade-off is investigated 
experimentally, and it is observed that variable length model performs better 
than fixed length on DNA sequences (see next section). 

Another point is the selection of the parameters e a and r for the proposed 
variable-length model. If one chooses a frequent symbol, the lengths of the blocks 
become shorter and the number of blocks get larger. Similarly small r values 
increase block count and vice versa. The trade-off between the block length and 
block number is also experimentally analyzed on DNA sequences, where it is 
observed that selecting rare symbols give better results. 

5 Implementation and experimental analysis on DNA 
sequences 

We have implemented an enumerative coding library with C++ that can be 
freely downloaded from www . busillis . com/EnumCodeLib . zip. Thanks to GNU 
multiple precision arithmetic librarjQ that lets us to overcome the big number 
problems due to factorial/combination/permutation calculations. Using this li- 
brary we have also implemented the proposed variable-length factorization as 
well as the fixed-length scheme, and computed the number of bits required with 
each method on a series of DNA sequence files that are mostly used as the bench- 
mark corpus on DNA compression algorithms. The results are given in table [31 





Finite-set 


Fixcd-Len. 


Variablc-Lcn. 




Avg. Block 


File name Size a c g t 


Ho Entropy 


Entropy 


Entropy 




Length 


chmpxx 121024 42896 17309 17556 43263 


1.866 


1.871 


1.845 


g 


504 


chntxx 155844 47824 29991 28992 49037 


1.957 


1.969 


1.953 


g 


402 


hehcmv 229354 49475 64911 66192 48776 


1.985 


1.975 


1.977 


■ 


796 


humdyst 38770 12001 7161 7011 12597 


1.946 


1.956 


1.950 


c 


692 


humghes 66495 17311 16271 16441 16472 


1.999 


2.028 


1.999 


a 


493 


humhbb 73308 22068 14146 14785 22309 


1.967 


1.972 


1.956 


t 


424 


humhdab 58864 13422 14846 15906 14690 


1.997 


1.999 


1.977 


t 


510 


humprtb 56737 15689 11281 11599 18168 


1.970 


1.990 


1.963 


g 


630 


mpomtcg 186609 53206 39215 39924 54264 


1.983 


1.998 


1.983 


c 


614 


mtpacga 100314 35804 13428 16724 34358 


1.879 


1.882 


1.883 


c 


955 


vaccg 191737 63921 32010 32030 63776 


1.919 


1.929 


1.926 


c 


770 


Average entropy in bits/base 


1.952 


1.961 


1.946 



Table 3. Comparison of results obtained by fixed- and variable-length enumerative 
coding on sample DNA sequences. The block length for the fixed-length scheme is 
2048, where the r parameter is 128 on variable-length 
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We tested windows of lengths 4, 8, 16, 32, . . . , 2048 for the fixed-length scheme. 
We observed that best results have been observed by the longest size 2048. 
We followed a two-pass procedure for the variable-length enumeration. In the 
first pass we calculated the block lengths, and in the second phase we com- 
puted the frequency-vector enumerations as well as the permutation identi- 
ties of each block. The bits per base entropy values has been computed with 

Tii^iiiEiLilogN + \logK(a- l,\bi\)] + r ei! . c J | ;L S c g ! l- We used r values of 
4,8, 16,32,64, and 128 for each possible e a = {a,c,g,t}. We observed best re- 
sults with r = 128 and when e a is usually one of the least frequent symbol. The 
corresponding average block lengths are depicted in the table. 

Proposed variable-length model achieved better results than the naive fixed- 
length model. Note that the entropy measured with the variable-length scheme 
is on the average better than the possible theoretical zero-order finite set en- 
tropy of the target sequence. That is due to saving one symbol per block with 
the introduced model, and the skewed distribution of symbols in blocks, which 
enabled us to identify them with less bits via permutation indices. Another im- 
portant point is proposed variable-length block enumeration scheme achieved 
better results with shorter blocks than the naive fixed-length scheme, which is 
also important as it requires more time to compute the vector and permutation 
indices on longer lengths. 

6 Conclusions 

We have investigated the enumerative encoding of g-ary sequences by introduc- 
ing a new method, where we factorize the input sequence into variable size blocks 
and then represent each block via its length, enumerated coding of its symbol 
frequency vector, and the regarding permutation index. We have described the 
necessary procedures required for vector-to-index and sequence-to-index con- 
versions along with their reverse operations for appropriate decoding. We have 
devised experiments on DNA sequences to compare the performances of the pro- 
posed variable- versus naive fixed-length enumerations with respect to the zero 
order finite set entropies of the sample sequences. 

The enumeration library used in this study, which we have implemented via 
the GNU multiple precision arithmetic toolbox, is publicly available^. We believe 
that one of the main reasons for relatively less attention paid on enumerative 
coding is because of the lack of such publicly available resources. Notice that one 
can find many libraries for arithmetic, Huffman, or dictionary based Liv-Zempel 
factorizations, however, not for enumerative coding. 

It would be interesting to investigate more options on enumeration based 
approaches in data compression, as well as text indexing. Benefiting from enu- 
merative coding in information theoretic analyses of biological sequences is yet 
another direction. 

3 EnumCodeLib at www.busillis.com/EnumCodeLib.zip 
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